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Abstract 


In this work we show how to represent policies 
as programs: that is, as stochastic simulators 
with tunable parameters. To learn the param¬ 
eters of such policies we develop connections 
between black box variational inference and 
existing policy search approaches. We then 
explain how such learning can be implemented 
in a probabilistic programming system. Us¬ 
ing our own novel implementation of such a 
system we demonstrate both conciseness of 
policy representation and automatic policy 
parameter learning for a set of canonical rein¬ 
forcement learning problems. 


1 Introduction 


In planning under uncertainty the objective is to find 
a policy that selects actions, given currently available 
information, in a way that maximizes expected re¬ 
ward. In many cases an optimal policy can neither 
be represented compactly nor learned exactly. On¬ 
line approaches to planning, such as Monte Carlo Tree 
Search Kocsis and Szepesvari, 2006 , are nonparamet- 
ric policies that select actions based on simulations 
of future outcomes and rewards, also known as roll¬ 
outs. While policies like this are often able to achieve 
near optimal performance, they are computationally 
intensive and do not have compact parameterizations. 
Policy search methods (see Deisenroth et al. 2011 for a 
review) learn parameterized policies offline, which then 
can be used without performing rollouts at test time, 
trading off improved test-time computation against 
having to choose a policy parameterization that may 
be insufficient to represent the optimal policy. 


In this work we show how probabilistic programs can 
represent parametric policies in a both more general 
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and compact manner. We also develop automatic infer¬ 
ence techniques for probabilistic programming systems 
to do model-agnostic policy search. Our proposed ap¬ 
proach, which we call black box policy learning (BBPL), 
is a variant of Bayesian policy search |Wingate et al.| 
2011 2013 in which policy learning is cast as stochastic 


gradient ascent on the marginal likelihood. 

In contrast to languages that target a single domain- 


specific algorithm Andre and Russell 

to 

o 

o 

to 

Srivastava 

et al. 

2014 

Nitti et al. 

2015 , our formulation empha- 


sizes the use of general-purpose techniques for Bayesian 
inference, in which learning is used for inference amor¬ 
tization. To this end, we adapt black-box variational 
inference (BBVI), a technique for approximation of the 
Bayesian posterior |Ranganath et al.| |2014| |Wingate 
and Weber| |2013| to perform (marginal) likelihood 
maximization in arbitrary programs. The resulting 
technique is general enough to allow implementation 
in a variety of probabilistic programming systems. We 
show that this same technique can be used to per¬ 
form policy search under an appropriate planning as 
inference interpretation, in which a Bayesian model is 
weighted by the exponent of the reward. The resulting 
technique, BBPL is closely related to classic policy gra¬ 
dient methods such as REINFORCE [Williams , 1992 


We present case studies in the Canadian traveler prob¬ 
lem, the RockSample domain, and introduce a setting 
inspired by Guess Who [Coster and Coster 1979 


as a 


benchmark for optimal diagnosis problems. 


2 Policies as Programs 


Probabilistic programming systems Milch et al. 

2007 

Goodman et al. [ 2008 Minka et al. 20141 Pfeffer 

2009 

Mansinghka et al.[ [2014, Wood et al.[ 2014J Gordon 

et al.| 2014 represent generative models as programs 


in a language that provides specialized syntax to in¬ 
stantiate random variables, as well as syntax to impose 
conditions on these random variables. The goal of in¬ 
ference in a probabilistic program is to characterize 
the distribution on its random variables subject to the 
imposed conditions, which is done using one or more 
generic methods provided by an inference backend. 
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(defquery ctp 

"Probabilistic program representing an agent 
solving the Canadian Traveler Problem" 

[graph src tgt base-prob make-policy] 

(let [sub-graph 

(sample-weather graph base-prob src tgt) 
[path dist counts] 

(dfs-agent sub-graph src tgt (make-policy 
(factor (- dist)) 

(predict :path path) 

(predict distance dist) 

(predict :counts counts))) 

(defm dfs-agent 

"Run depth-first-search from start to target, 
prioritizing edges according to policy" 

[graph start target policy] 

... ) 


(defm make-random-policy 
"Policy: Select edge at random” 

[] 

(fn policy [u vs] 

(sample 

(categorical 

(zipmap vs (repeat (count vs) 1.)))))) 

(defm make-edge-policy 
"Policy: learn priorities for each edge" 

[] 

(let [Q (mem (fn [u v] 

(sample [u v] 

(tag :policy 
(gamma 1. 1 . )))))] 

(fn policy [u vs] 

(argmax 

(zipmap vs (map (fn [v] (Q u v)) vs)))))) 



Figure 1: A Canadian traveler problem (CTP) implementation in Anglican. In the CTP, an agent must travel 
along a graph, which represents a network of roads, to get from the start node (green) to the target node (red). 
Due to bad weather some roads are blocked, but the agent does not know which in advance. Upon arrival at each 
node the agent observes the set of open edges. The function dfs-agent walks the graph by performing depth-first 
search, calling a function policy to choose the next destination based on the current and unvisited locations. 
The function make-random-policy returns a policy function that selects destinations uniformly at random, whereas 
make-edge-policy constructs a policy that selects according to sampled edge preferences (Q u v). By learning a 
distribution on each value (Q u v) through gradient ascent on the marginal likelihood, we obtain a heuristic offline 
policy that follows the shortest path when all edges are open, and explores more alternate routes as more edges 
are closed. 


In sequential decision problems we must define a 
stochastic simulator of an agent, which chooses ac¬ 
tions based on current contextual information, and a 
stochastic simulator of the world, which may have some 
internal variables that are opaque to the agent, but 
provides new contextual information after each action. 
For sufficiently simple problems, both the agent and 
the world simulator can be adequately described as 
graphical models. Here we are interested in using prob¬ 
abilistic programs as simulators of both the world and 
the agent. The trade-off made in this approach is that 
we can incorporate more detailed assumptions about 
the structure of the problem into our simulator of the 
agent, which decreases the size of the search space, at 
the expense of having to treat these simulators as black 
boxes from the perspective of the learning algorithm. 

In Figure [l] we show an example of a program, written 


in the language Anglican |Wood et al. , 2014 , which 
simulates an agent in the Canadian traveler problem 
(CTP) domain. This agent traverses a graph using 
depth first search (DFS) as a base strategy, choosing 
edges either at random, or according to sampled prefer¬ 
ences. Probabilistic programs can describe a family of 
algorithmic policies, which may make use of program¬ 
ming constructs such as recursion, and higher-order 
functions and arbitrary deterministic operations. This 
allows us to define structured policies that enforce ba¬ 
sic constraints, such as the rule that you should never 
travel the same edge twice. 


Given a base policy program, we can define different 
parametrizations that encode additional structure, such 
as the typical travel distance starting from each edge. 
We can then formulate a Bayesian approach to pol¬ 
icy learning, in which we place a prior on the policy 
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parameters and optimize its hyperparameters to max¬ 
imize the reward. To do so we employ a planning as 
inference interpretation Tonssaint et al.[ 2006[ Rawlik 


et al. , 2012 , JN eumann| |2011| |Hoffman et al.[ 2QQ9a|b[ 


Levine and Koltun| 2013 that casts policy search as 
stochastic gradient ascent on the marginal likelihood. 


A challenge in devising methods for approximate in¬ 
ference in probabilistic programs is that such methods 
must deal gracefully with programs that may not in¬ 
stantiate the same set of random variables in each 
execution. For example, the random policy in Figure [l] 
will generate a different set of categorical variables in 
each execution, depending on the path followed through 
the graph. Similarly, the edge based policy samples 
values (Q u v) lazily, depending on the visited nodes. 


In this paper we develop an approach to policy learn¬ 
ing based on black box variational inference (BBVI) 


[Ranganath et al. 2014 Wingate and Weber| 2013| , a 

technique for variational approximation of the posterior 
in Bayesian models. We begin by reviewing planning as 
inference formulations of policy search. We then show 
how BBVI can be adapted to perform hyperparameter 
optimization. In a planning as inference interpretation 
this method, which we call black box policy learning 
(BBPL), is equivalent to classic policy gradient meth¬ 
ods. We then describe how BBPL may be implemented 
in the context of probabilistic programs with varying 
numbers of random variables, and provide a language- 
agnostic definition of the interface between the program 
and the inference back end. 


places a prior p\(0) on the policy parameters 

J\ = j R{t)p\(t,6) drdQ, (3) 

Px(r,9) :=p x (6)p(T\0). (4) 

Upper-level policy search can be interpreted as maxi¬ 
mization of the normalizing constant Z\ of an unnor- 
malized density 


7 a(t, 0) = p\(r, 0) exp (/3R(t)), (5) 

Z\ = J 7 a (t, 9) dr dO (6) 

= E PA [exp(/3i?(r))]. (7) 


The constant /3 > 0 has the interpretation of an ‘inverse 
temperature’ that controls how strongly the density 
penalizes sub-optimal actions. The normalization con¬ 
stant Z\ is the expected value of the exponentiated 
reward exp(/LR(r)), which is known as the desirabil¬ 
ity in the context of optimal control Kappen, 2005 


Todorovj |2009| . It is not a priori obvious that max¬ 


imization of the expected reward J\ yields the same 
policy hyperparameters as maximization of the desire- 
ability Z\, but it turns out that the two are in fact 
equivalent, as we will explain in section |5j 


In planning as inference formulations, 7 a (t, 0)/Z\ is of¬ 
ten interpreted as a posterior p\(r, 0 \ r ) conditioned on 
a pseudo observable r = 1 that is Bernoulli distributed 
with probability p(r = 11 r) oc exp(/LR(r)), resulting 
in a joint distribution that is proportional to 7 a (r, #), 


3 Policy Search as Bayesian Inference 

In sequential decision problems, an agent draws 
an action u t from a policy distribution 7r(u t \x t ), 
which may be deterministic, conditioned on a con¬ 
text x t . The agent then observes a new context 
Xt+i drawn from a distribution p(x t +\ \ u t: x t ). In 
the finite horizon case, where an agent performs a 
fixed number of actions T, resulting in a sequence 
r = (xo, uo, xi, ui, X 2 , • • •, UT-h xt), which is known 
as a trajectory, or roll-out. Each trajectory gets a 
reward R(r). Policy search methods maximize the ex¬ 
pected reward Jq = K pe [R(r)] for a family of stochastic 
policies 7 tq with parameters 0 

Je = J R{t)pb(t) dr, ( 1 ) 

T—1 

Pe(r) := p(x o) P[ 7r (u t | x t , 9)p(x t+ 1 1 u t , x t ). (2) 

t =0 

We are interested in performing upper-level policy 
search, a variant of the problem defined in terms of 
the hyperparameters A of a distribution Pa(u$) 


p(r = 1, r, 9) oc p x (T, 9) exp (/3R(t)) = 7 A (r, 9). ( 8 ) 


Maximization of Z\ is then equivalent to the maximiza¬ 
tion of the marginal likelihood p\(r = 1 ) with respect 
to the hyperparameters A. In a Bayesian context this 
is known as empirical Bayes (EB) |Maritz and Lwin 


1989 , or type II maximum likelihood estimation. 


4 Black-box Variational Inference 


Variational Bayesian methods [Wainwright and Jordan 


2008 approximate an intractable posterior with a more 


tractable family of distributions. For purposes of ex¬ 
position we consider the case of a posterior p(z, 0 \ y ), 
in which y is a set of observations, 0 is a set of model 
parameters, and 2 is a set of latent variables. We write 
p(z,9\y)= 7 (z,9)jZ with 


l{z,9) = p{y\z,9)p(z\9)p(9), (9) 

Z = J 7 ( 2 , 6 *) dzd9. ( 10 ) 

Variational methods approximate the posterior using a 
parametric family of distributions q\ by maximizing a 
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lower bound on log Z with respect to A 

C x = E 9A [log 7 ( 2 , 6 !) - log q\(z,0)] ( 11 ) 

= log Z- D Kh (q x (z)\\ 1 (z)/Z) < log Z. (12) 
This objective may be optimized with stochastic gradi¬ 


ent ascent 


Hoffman et al.| |2013 


Ajfe+l — A/e + Pk^\£\\ x= 


A=A 


V \C\ = ^ qx ( z ) 


V A log q\(z) log 


7M) 


(13) 

(14) 


Qx(z,9))_ 

Here pk is a sequence of step sizes that satisfies the 
conditions Y1T=i P/c = 00 and Pk < 00 - The 

calculation of the gradient V\C\ requires an integral 
over q x . For certain models, specifically those where 
the likelihood and prior are in the conjugate exponen¬ 
tial family |Hoffman et al. , 2013 , this integral can be 
performed analytically. 

Black box variational inference targets a much broader 
class of models by sampling , 9^ ~ q x and replacing 
the gradient for each component i with a sample-based 


estimate Ranganath et al. 2014 


N 


\7 Xi C x = ^V Ai log q x (z [n \eW)(logwW -bi), (15) 

n= 1 

w [n] =7 ( z N ; 6»W)/g A (2^,61 ["]), (16) 


in which bi is a control variate that reduces the variance 
of the estimator 


h = 


E„ = l(VAaog9A^ N ^ W )) 2 ^ [i 

Eli(V A ,logg A (2N,0M))2 


(17) 


5 Black-box Policy Search 


The sample-based gradient estimator in BBVI resem¬ 
bles the one used in classic likelihood-ratio policy gra¬ 
dient methods Deisenroth et al.| 2011] , such as RE¬ 
INFORCE [Williams[ |1992] G(PO )MDP [Baxter and 


Bartlett, 1999 Baxter et al.||1999] , and PGT Sutton 


et al. 1999 . There is in fact a close connection be¬ 


tween BBVI and these methods, as has been noted by 
e.g. Dayan et al. [1995 , Mnih and Gregor [2014 and 


Ba et al. 2014 


To make this connection precise, let us consider what it 
would mean to perform variational inference in a plan¬ 
ning as inference setting. In this case, we can define 
a lower bound £a,a 0 on l°g^A 0 i n terms of a varia¬ 
tional distribution q x (r, 0) with parameters A and an 
unnormalized density 7 a 0 (t, 0) of the form in equation 
[5j with parameters Aq 


£a,Ao = E 9A [log7A 0 (2,6>) - log <27(2,60] 


= E, 


Qx 


/3R(t) + log 


PAq (t,0) ~ 
Q\(t,0) _ 


(18) 

(19) 


If we now choose a variational distribution with the 
same form as the prior, then q x (r,0) = P\ 0 {r,6) when¬ 
ever A = Ao. Under this assumption, the lower bound 
at A = Aq simplifies to 


C 


A,A 0 


A=A 0 


= E qx m(r)) 


A=A 0 


PJx 


A=A 0 


( 20 ) 


In other words, the lower bound £a,a 0 I s proportional to 
the expected reward J x when the variational posterior 
is equal to the prior. 

The gradient of the lower bound similarly simplifies to 


V a £a,a 0 


A=A 0 


= E, 


Qx 


V A log q\(r,6) log 


7a 0 (t, 6 ) 

Q\(t, 9 ) 


A=Ao 


E, 


q\ 0 


V A log qx(r,9) @R(t) 

A=Aq 


= [ drd9 X7\qx{r,6) /3 R(t) 

J A=A 0 


= V a Ja 


A=A 0 


The implication of this identity is that we can perform 
gradient ascent on J x by making a slight modification 
to the update equation 


A/c+l — A/e + Pk^\£\,\ k | x=Xk . 


( 21 ) 


The difference in these updates is that instead of cal¬ 
culating the gradient Va£a,a 0 es Umate relative to a 
fixed set of prior parameters Ao, we update the pa¬ 
rameters of the prior p Xk (r, 6) after each gradient step, 
and calculate the gradient V\C\ : \ k . We note that the 
constant /3 is simply a scaling factor on the step sizes 
Pk, and will from here on assume that /3 = 1 . 


When BBVI is performed using the update step in 
equation 21 , and the variational family q x is chosen 
to have the same form as the prior p x , we obtain a 
procedure for EB estimation, which maximizes the 
normalizing constant Z x with respect to the parameters 
A of the prior. The difference between the EB and 
maximum likelihood (ML) methods is that the first 
calculates the gradient relative to hyperparameters A, 
whereas the other calculates the gradient relative to the 
parameters 0. Because this difference relates only to the 
assumed model structure, EB estimation is sometimes 
referred to as Type II maximum likelihood. 


As is evident from equation [20] EB estimation in the 
context of planning as inference formulations maxi¬ 
mizes the expected reward J x . In the context of a 
probabilistic programming system this means that we 
can effectively get three algorithms for the price of one: 
If we can provide an implementation of BBVI, then 
this implementation can be adapted to perform EB 
estimation, which in turn allows us to perform pol¬ 
icy search by simply defining models where exponent 
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of the reward takes the place of the likelihood terms. 
This results in a method that we call black box policy 
learning (BBPL), which is equivalent to variants of 
REINFORCE applied to upper-level policy search. 


6 Learning Probabilistic Programs 


An implementation of BBVI and BBPL for probabilistic 
program inference needs to address two domain-specific 
issues. The first is that probabilistic programs need not 
always instantiate the same set of random variables, 
the second is that we need to distinguish between dis¬ 
tributions that define model parameters 0 and those 
that define latent variables z, or variables that are part 
of the context x in the case of decision problems. 


Let us refer back to the program in Figure |TJ The 
function dfs-agent performs a recursive loop until a 
stopping criterion is met: either the target node is 
reached, or there are no more paths left to try. At each 
step dfs-agent makes a call to policy, which is created 
by either calling make-random-policy or make-edge-policy. 
A random policy samples uniformly from unexplored 
directions. When depth first search is performed with 
this policy, we are defining a model in which the number 
of context variables is random, since the number of 
steps required to reach the goal state will vary. In the 
case of the edge policy, we use a memoized function to 
sample edge preference values as needed, choosing the 
unexplored edge with the highest preference at each 
step. In this case the number of parameter variables 
is random, since we only instantiate preferences for 
edges that are (a) open, and (b) connect to the current 
location of the agent. 


As has been noted by Wingate and Weber 2013 , BBVI 
can deal with varying sets of random variables quite 
naturally. Since the gradient is computed from a sam¬ 
ple estimate, we can compute gradients for a each 
random variable by simply averaging over those execu¬ 
tions in which the variable exists. Sampling variables 
as needed can in fact be more statistically efficient, 
since irrelevant variables that never affect the trajec¬ 
tory of the agent will not contribute to the gradient 
estimate. BBVI has the additional advantage of having 
relatively light-weight implementation requirements; it 
only requires differentiation of the log proposal den¬ 
sity, which is a product over primitive distributions 
of a limited number of types, for which derivatives 
can be computed analytically. This is in contrast to 
implementations based on (reverse-mode) automatic 
differentiation |Pearlmutter and Siskind] [2008 


as is 


used in Stan Kucukelbir et ah, 20 T 5 J , which store 
derivative terms for the entire computation graph. 


To provide a language-agnostic definition of BBVI and 
BBPL, we formalize learning in probabilistic programs 


as the interaction between a program V and an in¬ 
ference back end B. The program V represents all 
deterministic steps in the computation and has inter¬ 
nal state (e.g. its environment variables). The back 
end B performs all inference-related tasks. 

A program V executes as normal, but delegates to the 
inference back end whenever it needs to instantiate a 
random variable, or evaluate a conditioning statement. 
The back end B then supplies a value for the random 
variable, or makes note of the probability associated 
with the conditioning statement, and then delegates 
back to V to continue execution. We will assume 
that the programming language provides some way 
to differentiate between latent variables z, which are 
simply to be sampled, and parameters 6 for which a 
distribution is to be learned. In Anglican the syntax 
(sample (tag : policy d)), as used in Fig. [lj is used as a 
general-purpose mechanism to label distributions on 
random variables. An inference back end can simply 
ignore these labels, or implement algorithm-specific 
actions for labeled subsets. 

In order for the learning algorithm to be well-defined in 
programs that instantiate varying numbers of random 
variables, we require that the each random variable z a 
is uniquely identified by an address a, which may either 
be generated automatically by the language runtime, 
or specified by the programmer. Each model parameter 
6b is similarly identified by an address b. 

In BBVI, the interface between a program V and the 
back end B can be formalized with the following rules: 

• Initially B calls V with no arguments V(). 

• A call to V returns one of four responses to B: 

1 . (sample, a,/, 0): Identifies a latent random 
variable (not a policy parameter) z a with 
unique address a, distributed according to 
/a(* | 0a)- The back end generates a value 
Za ~ /a(-1 4>a) and calls T(z a ). 

2. (learn, 6 , /, 77 ): For policy parameters, the ad¬ 
dress b identifies a random variable 6 & in the 
model, distributed according to a distribution 
fb with parameters 775 . The back end gener¬ 
ates Ob ~ fb(- | A?,) conditioned on a learned 
variational parameter A 5 and registers an im¬ 
portance weight w b = fb(0 b \rib)/fb(0b\\b)- 
Execution continues by calling T(6b)- 

3. (factor, c, / ): Here c is a unique address for a 
factor with log probability l c and importance 
weight w c = exp(Z c ). Execution continues by 
calling V(). 

4. (return, v): Execution completes, returning 
a value v. 
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Because each call to V is deterministic, an execution 
history is fully characterized by the values for each 
random variable that are generated by B. However the 
set of random variables that is instantiated may vary 
from execution to execution. We write A, B , C for the 
set of addresses of each type visited in a given execution. 
The program V now defines an unnormalized density 
TP of the form 

7 v{z,0) :=p v (z,6) P[ exp(7 c ), ( 22 ) 

cec 

Pv(z,0) := P[ f a (z a I <!> a ) JJ f b (0 b I T] b ) . (23) 

aeA beB 

Implicit in this notation is the fact that the distribution 
types f a (- | (j) a ) and fb( m \Vb) are return values from calls 
to V, which implies that both the parameter values 
and the distribution type may vary from execution 
to execution. While f a (- \4>a) and | ijb) are fully 
determined by preceding values for z and 0 , we assume 
they are opaque to the inference algorithm, in the 
sense that no analysis is performed to characterize 
the conditional dependence of each <fi a or 775 on other 
random variables in the program. 

Given the above definition of a target density 7 p( 2 :,$), 
we are now in a position to define the density of a 
variational approximation Q\ to the program. In this 
density, the runtime values 75 are replaced by varia¬ 
tional parameters A 5 

pqx ( z , °) ■= n 1n b 1 a& ) • ( 24 ) 

oG-A b(zB 

This density corresponds to that of a mean-field proba¬ 
bilistic program, where the dependency of each 6 5 on 
other random variables is ignored. 

Repeated execution of V given the interface described 
above results in a sequence of weighted samples 
{w^ n \ 0^ n \ 2 :^), whose importance weight is de¬ 
fined as 


U)W := 7 ■ P 7 n 7 N ) / PQx(z [n] ,8 [n] ) 


TT /(^ I Vb) 
beB f(@b ^ I ^b) 


II eX P l c' ] ■ 

cec 


(25) 


With this notation in place, it is clear that we can 
define a lower bound £Q A ,Q Afc analogous to that of 
Equation [l9j and a gradient estimator analogous to 
that of Equation [To] in which the latent variables z 
take the role of the trajectory variables r. In summary, 
we can describe a sequential decision problem as a 
probabilistic program V in which the log probabilities 
l c are interpreted as rewards, parameters 0 5 define the 
policy and all other latent variables z a are trajectory 
variables. EB inference can then be used to learn the 


Algorithm 1 Black-box Policy Learning 

initialize parameters A 07 75 , iteration k = 0 

repeat 

Set initial \ k+1 = {A k,b}beB 
Run N executions of program Q\ k , generating 
(W n ], #[ n ], 2 )t n ]) according to Eqns. 
for each address b do 


24 


25 


Let Nb < N be the 7 ^ of runs containing b 

IW n] := V \ k b log /(^I* 4 1 V> ft ) 

Compute baseline b\ k b from Eq. [Tt] 


WVa* ^ Nb- 1 E»r(log^ W - b Xk , b 
Update \ k +i,b ^k,b + Pk^\ k , b J\ k 

end for 

k k + 1 

until parameters A 5 converge 


hyperparameters A that maximize the expected reward, 
as described in Algorithm [l] 

An assumption that we made when deriving BBPL is 
that the variational distribution q\(r : 0) must have the 
same analytical form as the prior p\ 0 (r,6). Practically 
this requirement means that a program V must be 
written in such a way that the values of the hyperpa¬ 
rameters rjb have the same constant values in every 
execution, since their values may not depend on those 
of random variables. One way to enforce this is to pass 
77 as a parameter in the initial call V(rj) by #, though 
we do not formalize such a requirement here. 


7 Case Studies 


We demonstrate the use of programs for policy search 
in three problem domains: (1) the Canadian Traveler 
Problem, ( 2 ) a modified version of the RockSample 
POMDP, and (3) an optimal diagnosis benchmark in¬ 
spired by the classic children’s game Guess Who. 

These three domains are examples of deterministic 
POMDPs, in which the initial state of the world is not 
known, and observations may be noisy, but the state 
transitions are deterministic. Even for discrete variants 
of such problems, the number of possible information 
states x t = (txo, 01 ,..., u t ~ 1 , o t ) grows exponentially 
with the horizon T, meaning that it is not possible to 
fully parameterize a distribution tt(u \ x, 6) in terms of a 
conditional probability table 9 XjU . In our probabilistic 
program formulations for these problems, the agent is 
modeled as an algorithm with a number of random 
parameters, and we use BBPL to learn the distribution 
on parameters that maximizes the reward. 


We implement our case studies using the probabilistic 
programming system Anglican Wood et al. 2014 . We 
use the same experimental setup in each of the three 
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Figure 2: Convergence for CTP domains of 20 and 
50 nodes. Blue lines show the mean traveled distance 
using the learned policy, averaged over 5 domains. Red 
lines show the mean traveled distance for the optimistic 
heuristic policy. Dash length indicates the fraction of 
open edges, which ranges from 1.0 to 0.6. 
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domains. A trial begins with a learning phase, in which 
BBPL is used to learn the policy hyperparameters, 
followed by a number of testing episodes in which the 
agent chooses actions according to a fixed learned policy. 
At each gradient update step, we use 1000 samples 
to calculate a gradient estimate. Each testing phase 
consists of 1000 episodes. All shown results are based 
on test-phase simulations. 


Stochastic gradient methods can be sensitive to the 
learning rate parameters. Results reported here use a 


RMSProp style rescaling of the gradient Hinton et al. 


which normalizes the gradient by a discounted rolling 
decaying average of its magnitude with decay factor 
0.9. We use a step size schedule pk = Po/(t + as 
reported in Hoffman et al. |2013 , with r = 1, k = 0.5 
in all experiments. We use a relatively conservative 
base learning rate po = 0.1 in all reported experi¬ 
ments. For independent trials performed across a range 
1, 2, 5,10,..., 1000 of total gradient steps, consistent 
convergence was observed in all runs using over 100 
gradient steps. 


The source code for the case studies, as well as the 
BBPL implementation, is available onlinej^] 


7.1 Canadian Traveler Problem 


Figure 3: Learned policies for the Rock Sample domain. 
Edge weights indicate the frequency at which the agent 
moves between each pair of rocks. Starting points are 
in green, exit paths in red. 


and heuristic online and offline approaches |Eyerich 


et al., 2010 are used to solve problem instances. 


The results in Figure [l] show that the learned policy 
behaves in a reasonable manner. When edges are open 
with high probability, the policy takes the shortest 
path from the start node, marked in green, to the 
target node, marked in red. As the fraction of closed 
edges increases, the policy makes more frequent use 
of alternate routes. Note that each edge has a fixed 
probability of being open in our set-up, resulting in a 
preference for routes that traverse fewer edges. 


Figure [2] shows convergence as a function of the number 
of gradient steps. Results are averaged over 5 domains 
of 20 and 50 nodes respectively. Convergence plots 
for each individual domain can be found in the sup¬ 
plementary material. We compare the learned policies 
against the optimistic policy, a heuristic that selects 
edges according to the shortest path, assuming that 
all unobserved edges are open. We observe that mean 
traveled distance for the learned policy converges to 
that of the optimistic policy, which is close to optimal. 


In the Canadian Traveler Problem (CTP) |Papadim 


itriou and Yannakakis 19911, an agent must traverse a 
graph G = (V, E), in which edges may be missing at 
random. It is assumed the agent knows the distance 
d : E M+ associated with each edge, as well as 
the probability p : E (0,1] that the edge is open, 
but has no advance knowledge of the edges that are 
blocked. The problem is NP-hard [Fried et al., 2013 


https://bitbucket.org/probprog/black-box-policy-search 


7.2 RockSample POMDP 


In the RockSample POMDP | Smith and Simmons 


2004 , an N x N square field with M rocks is given. A 


rover is initially located in the middle of the left edge of 
the square. Each of the rocks can be either good or bad; 
the rover must traverse the field and collect samples 
of good rocks while minimizing the traveled distance. 
The rover can sense the quality of a rock remotely with 
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Figure 4: (left) Average reward in Guess Who as a 
function of number of questions, (right) Convergence 
of rewards as function number of gradient steps. Each 
dot marks an independent restart. 

an accuracy decreasing with the distance to the rock. 
We consider a finite-horizon variant of the RockSample 
domain, described in the supplementary material, with 
a structured policy in which a robot travels along rocks 
in a left-to-right order. 

The policy plots in Figure [3] show that this simple policy 
results in sensible movement preferences. In particular 
we point out that in the 5x5 instance, the agent always 
visits the top-left rock when traveling to the top-middle 
rock, since doing so incurs no additional cost. Similarly, 
the agent follows an almost deterministic trajectory 
along the left-most 5 rocks in the 10 x 10 instance, but 
does not always make the detour towards the lower 
rocks afterwards. 


in given questions and responses. The agent selects 
randomly among the highest ranked candidates after 
the final question. We consider 3 policy variants, two 
of which are parameter-free baselines. In the first base¬ 
line, questions are asked uniformly at random. In 
the second, questions are asked according to a my¬ 


opic estimate of the value of information Hay et al 
2012 ], i.e. the change in expected reward relative to 


the current best candidates, which is myopically op¬ 
timal in this setting. Finally, we consider a policy 
that empirically samples questions q according to a 
weight v q = 7 nq (Ab) q , based on the current belief 5, a 
weight matrix A, and a discount factor r y Uq based on the 
number of times n q a question was previously asked. In¬ 
tuitively, this algorithm can be understood as learning 
a small set of a- vectors, one for each question, similar 


to those learned in point-based value iteration Pineau 


et al., 2003 . The discounting effectively “shrinks” the 


belief-space volume associated with the <a-vector of the 
current best question, allowing the agent to select the 
next-best question. 


The results in Figure [4] show that the learned policy 
clearly outperforms both baselines, which is a surpris¬ 
ing result given the complexity of the problem and the 
relatively simplistic form of this heuristic policy. While 
these results should not be expected to be in any way 
optimal, they are encouraging in that they illustrate 
how probabilistic programming can be used to imple¬ 
ment and test policies that rely on transformations 
of the belief or information state in a straightforward 
manner. 


7.3 Guess Who 


8 Discussion 


Guess Who is a classic game in which players pick a 
card depicting a face, belonging to a set that is known 
to both players. The players then take turns asking 
questions until they identify the card of the other player 


Coster and Coster 1979 . We here consider a single¬ 


player setting where an agent asks a pre-determined 
number of questions, but the responses are inaccurate 
with some probability. This is sometimes known as a 
measurement selection, or optimal diagnosis problem. 
We make use of a feature set based on the original 
game, consisting of 24 individuals, characterized by 
11 binary attributes and two multi-class attributes, 
resulting in a total of 19 possible questions. We assume 
a response accuracy of 0.9. By design, the structure 
of the domain is such that there is no clear winning 
opening question. However the best question at any 
point is highly contextual. 


We assume that the agent knows the reliability of the 
response and has an accurate representation of the 
posterior belief b t (s) = p(s \ x t ) for each candidate s 


In this paper we put forward the idea that probabilistic 
programs can be a productive medium for describing 
both a problem domain and the agent in sequential 
decision problems. Programs can often incorporate 
assumptions about the structure of a problem domain 
to represent the space of policies in a more targeted 
manner, using a much smaller number of variables than 
would be needed in a more general formulation. By 
combining probabilistic programming with black-box 
variational inference we obtain a generalized variant of 
well-established policy gradient techniques that allow us 
to define and learn policies with arbitrary levels of algo¬ 
rithmic sophistication in moderately high-dimensional 
parameter spaces. Fundamentally, policy programs rep¬ 
resent some form of assumptions about what contextual 
information is most relevant to a decision, whereas the 
policy parameters represent domain knowledge that 
generalizes across episodes. This suggests future work 
to explore how latent variable models may be used to 
represent past experiences in a manner that can be 
related to the current information state. 















Jan-Willem van de Meent, Brooks Paige, David Tolpin, Prank Wood 


Acknowledgements 

We would like to thank Thomas Keller for his assis¬ 
tance with Canadian traveler problem, and Rajesh 
Ranganath for helpful feedback on configuring RM- 
SProp for black-box variational inference. Frank Wood 
is supported under DARPA PPAML through the U.S. 
AFRL under Cooperative Agreement number FA8750- 
14-2-0006, Sub Award number 61160290-111668. 

References 

D. Andre and S. J. Russell. State Abstraction for Pro¬ 
grammable Reinforcement Learning Agents. In AAAI, 
2002. 

J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object 
recognition with visual attention. In Proceedings of the 
International Conference on Learning Representations, 
2014. arXiv: 1412.7755. 

J. Baxter and P. Bartlett. Direct gradient-based reinforce¬ 
ment learning: I. Gradient estimation algorithms. Tech¬ 
nical report, Computer Sciences Laboratory, Australian 
National University, 1999. 

J. Baxter, L. Weaver, and P. Bartlett. Direct gradient-based 
reinforcement learning: II. Gradient ascent algorithms 
and experiments. Technical report, Computer Sciences 
Laboratory, Australian National University, 1999. 

T. Coster and O. Coster. Guess Who? http://theoradesign, 
com, 1979. 

P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The 
Helmholtz machine. Neural Computation , 7(5):889-904, 
1995. 

M. P. Deisenroth, G. Neumann, and J. Peters. A Survey 
on Policy Search for Robotics. Foundations and Trends 
in Robotics , 2(2011):1-142, 2011. 

P. Eyerich, T. Keller, and M. Helmert. High-quality policies 
for the Canadian traveler’s problem. In AAAI , 2010. 

D. Fried, S. E. Shimony, A. Benbassat, and C. Wenner. 
Complexity of Canadian traveler problem variants. Theor. 
Comput. Sci ., 487:1-16, 2013. 

N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, 
and J. B. Tenenbaum. Church: a language for generative 
models. In Uncertainty in Artificial Intelligence , pages 
220-229, 2008. 

A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. 
Rajamani. Probabilistic programming. In International 
Conference on Software Engineering (ICSE, FOSE track), 
2014. 

N. Hay, S. Russell, D. Tolpin, and S. Shimony. Selecting 
Computations: Theory and Applications. In Uncertainty 
in Artificial Intelligence , 2012. 

G. Hinton, N. Srivastava, and K. Swersky. 
http://www.es.toronto.edu/~tijmen/csc321/slides/ 
lecture_slides_lec6.pdf 

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. 
Stochastic variational inference. Journal of Machine 
Learning Research , 14:1303-1347, 2013. 

M. W. Hoffman, N. d. Freitas, A. Doucet, and J. R. Peters. 
An expectation maximization algorithm for continuous 


Markov Decision Processes with arbitrary reward. In 
International Conference on Artificial Intelligence and 
Statistics , pages 232-239, 2009a. 

M. W. Hoffman, H. Kueck, N. de Freitas, and A. Doucet. 
New inference strategies for solving Markov decision 
processes using reversible jump MCMC. In Uncertainty 
in Artificial Intelligence , pages 223-231. AUAI Press, 
2009b. 

H. J. Kappen. Path integrals and symmetry breaking for 
optimal control theory. Journal of Statistical Mechanics: 
Theory and Experiment , 2005(11), 2005. PI 1011. 

L. Kocsis and C. Szepesvari. Bandit based Monte-Carlo 
planning. In European Conference on Machine Learning, 
pages 282-293, 2006. 

A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. Blei. 
Automatic Variational Inference in Stan. Neural Infor¬ 
mation Processing Systems , 2015. 

S. Levine and V. Koltun. Guided Policy Search. In Inter¬ 
national Conference on Machine Learning , volume 28, 
pages 1-9, 2013. 

V. Mansinghka, D. Selsam, and Y. Perov. Venture: a 
higher-order probabilistic programming platform with 
programmable inference. arXiv preprint arXiv: If Of. 0099, 
2014. 

J. S. Maritz and T. Lwin. Empirical Bayes methods, vol¬ 
ume 35. Chapman and Hall, London, 1989. ISBN 
0412277603. 

B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and 
A. Kolobov. Blog: Probabilistic models with unknown 
objects. Statistical relational learning, page 373, 2007. 

T. Minka, J. Winn, J. Guiver, S. Webster, Y. Za- 
ykov, B. Yangel, A. Spengler, and J. Bronskill. In¬ 
fer.NET 2.6, 2014. Microsoft Research Cambridge, 
http: //resear ch.microsoft.com/infer net. 

A. Mnih and K. Gregor. Neural variational inference and 
learning in belief networks. In Proceedings of The 31st 
International Conference on Machine Learning, pages 
1791-1799, 2014. 

G. Neumann. Variational Inference for Policy Search in 
Changing Situations. In International Conference on 
Machine Learning, 2011. 

D. Nitti, V. Belle, and L. De Raedt. Planning in Discrete 
and Continuous Markov Decision Processes by Proba¬ 
bilistic Programming. In ECML PKDD, Lecture Notes in 
Computer Science, pages 327-342, Cham, 2015. Springer 
International Publishing. 

C. H. Papadimitriou and M. Yannakakis. Shortest paths 
without a map. Theor. Comput. Sci., 84(1):127-150, July 
1991. 

B. A. Pearlmutter and J. M. Siskind. Using programming 
language theory to make automatic differentiation sound 
and efficient. Advances in Automatic Differentiation, 
pages 79-90, 2008. 

A. Pfeffer. Figaro: An object-oriented probabilistic pro¬ 
gramming language. Technical report, 2009. 

J. Pineau, G. Gordon, and S. Thrun. Point-based value 
iteration: An anytime algorithm for POMDPs. In In¬ 
ternational Joint Conference on Artificial Intelligence, 
pages 1025-1030, 2003. 



Black-Box Policy Search with Probabilistic Programs 


R. Ranganath, S. Gerrish, and D. M. Blei. Black Box Vari¬ 
ational Inference. In Artificial Intelligence and Statistics, 
2014. 

K. Rawlik, M. Toussaint, and S. Vijayakumar. On Stochas¬ 
tic Optimal Control and Reinforcement Learning by Ap¬ 
proximate Inference. On Stochastic Optimal Control and 
Reinforcement Learning by Approximate Inference, (2), 
2012. 

T. Smith and R. Simmons. Heuristic Search Value Iteration 
for POMDPs. In Uncertainty in Artificial Intelligence, 
pages 520-527, Arlington, Virginia, United States, 2004. 
AUAI Press. 

S. Srivastava, S. Russell, P. Ruan, and X. Cheng. First- 
Order Open-Universe POMDPs. In Uncertainty in Arti¬ 
ficial Intelligence , 2014. 

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Pol¬ 
icy Gradient Methods for Reinforcement Learning with 
Function Approximation. Neural Information Processing 
Systems , pages 1057-1063, 1999. 

E. Todorov. Efficient computation of optimal actions. Proc. 
Nat. Acad. Sci. of America, 106(28):11478-11483, 2009. 

M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic 
inference for solving (PO)MDPs. Neural Computation, 
31 (December) :357-373, 2006. 

M. J. Wainwright and M. I. Jordan. Graphical Models, 
Exponential Families, and Variational Inference. Foun¬ 
dations and Trends in Machine Learning , 1(12): 1-305, 
2008. 

R. J. Williams. Simple statistical gradient-following algo¬ 
rithms for connectionist reinforcement learning. Machine 
Learning , 8(3-4):229-256, 1992. 

D. Wingate and T. Weber. Automated variational in¬ 
ference in probabilistic programming. arXiv preprint 
arXiv:1301.1299, 2013. 

D. Wingate, N. D. Goodman, D. M. Roy, L. P. Kaelbling, 
and J. B. Tenenbaum. Bayesian policy search with policy 
priors. In International Joint Conference on Artificial 
Intelligence, pages 1565-1570, 2011. 

D. Wingate, C. Diuk, T. O. Donnell, J. Tenenbaum, S. Ger- 
shman, L. Labs, and J. B. Tenenbaum. Compositional 
Policy Priors. Technical report, Computer Science and 
Artificial Intelligence Laboratory, Cambridge, MA, 2013. 

F. Wood, J. van de Meent, and V. Mansinghka. A new 
approach to probabilistic programming inference. In 
Artificial Intelligence and Statistics, pages 1024-1032, 
2014. 



Jan-Willem van de Meent, Brooks Paige, David Tolpin, Prank Wood 


A Anglican 

All case studies are implemented in Anglican, a probabilistic programming language that is closely integrated into the 
Clojure language. In Anglican, the macro defquery is used to define a probabilistic model. Programs may make use of 
user-written Clojure functions (defined with defn) as well as user-written Anglican functions (defined with defm). The 
difference between the two is that in Anglican functions may make use of the model special forms sample, observe, and 
predict, which interrupt execution and require action by the inference back end. In Clojure functions, sample is a primitive 
procedure that generates a random value, observe returns a log probability, and predict is not available. 

Full documentation for Anglican can be found at 
http://www.robots.ox.ac.uk/~fwood/anglican 


The complete source code for the case studies can be found at 
https://bitbucket.org/probprog/black-box-policy-search 


B Canadian Traveler Problem 

The complete results for the Canadian traveler problem, showing the performance and convergence for the learned policies 
for multiple graphs of different sizes and topologies, are presented in Figures [5] and [6] 

C RockSample 

The RockSample problem was formulated as a benchmark for value iteration algorithms and is normally evaluated in 
an infinite horizon setting where the discount factor penalizes sensing and movement. In the original formulation of the 
problem, movement and sensing incur no cost. The agent gets a reward of 10 for each good rock, as well as for reaching 
the right edge, but incurs a penalty of -10 when sampling a bad rock. 

Here we consider an adaptation of RockSample to a finite horizon setting. We assume sensing is free, and movement 
incurs a cost of -1. We structure the policy by moving along rocks in a left-to-right order. At each rock the agent sense the 
closest next rock and chooses to move to it, or discard it and consider the next closest rock. When the agent gets to a 
rock, it only samples the rock if the rock is good. The parameters describe the prior over the probability of moving to a 
rock conditioned on the current location and the sensor reading. 

D Guess Who 

In Table [I] we provide as reference the complete ontology for the Guess Who domain. At each turn, the player asks whether 
the unknown individual has a particular value of a single attribute. 
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Figure 5: Canadian traveler problem: edge weights, indicating average travel frequency under the learned policy, 
and convergence for individual instances with 20 nodes. 
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Figure 6: Canadian traveler problem: edge weights, indicating average travel frequency under the learned policy, 
and convergence for individual instances with 50 nodes. 
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